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ABSTRACT . ^ 

Action research often necessitates the use of intact 
groups for the comparison of educational treatments or programs. This 
paper considers several analytical methods that might be used for 
such situations when pretest scores indicgite that' these intact groups 
differ significantly initially. The methods considered include gain 
score analysis of variance (ANOVA)., analysis of covariance (ANCOVA) 
(using both raw scores and estimated true scores) , value-added 
analysis, and within group dependent t-tests, all on a common set of 
real data from nonequivalent intact groups. Seemingly contradictory 
results were obtained for this data with gain score ANOVA and with^ 
ANCOVA. Comparable results should be expected to occur routinely with 
data from nonequi valent groups. In view of these results, it is 
recommended that statistical cc p^' bisons across nonequivalent groups 
be avoided. However, within group comparisons may aid somewhat in 
such evaluations of alternative educational programs. (Author/RC) 
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ABSTl^CT - * - 

Strategies for Analyzing Data from Intact Groups 
Lawrence H. Cross and Carolyn E. Lane 
Virginia Polytechnic Institute and State University 

^'Action research" often necessitates the use of Intact groups for^the 
comparison of educational treatments or programs. The purpose of this paper 
is to consider several analytical methods that might' be used for such situa- 
tions when pretest scot^.es indicate that these intact groups differ significantly 
initially. 

The methods considered include gain score ANOVA, ANCOVA (using both 
raw scores and estimated true scores), value-added analysis, and within group 
dependent t-tests, all on a common set of real data from nonequivalent intact 
groups. Seemingly contradictory results were obtained for this data with 
gain score mOVA and with ANCuVA. 'Comparable results should be expected to 
occur routinely with data from nonequivalent groups. 

' In view of these results, it is recommended that statistical comparisons 
across nonequivalent groups be avoided. However, within group comparisons . 
may ^id somewhat in such evaluations O'f alternative educational programs. 



strategies for Analyzing Data from Intact Groups 

Lawrence H. Cross 
Carolyn E, Lane 
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The term "action research" suggests research which is responsive to the 
immediate needs of a decision maker in a particular setting. As such, time 
and administrative obstacles may argue for the use of extant groups to com- 
pare two or more treatmentr or programs. Since random assignment of sub- 
jects to groups is not possible, such a research design would be considered 
a quasi--experimental, non-equivalent control group design in the Campbell 
and Stanley (1963) taxonomy- With such designs, it is advisable to pretest 
the subjects to determine the extent to which the intact groups differ with 
respect to the variables under study. If the mean pretest scores for the 
groups do not differ significantly, one may wish to assume that the groups 

were, in effect, randomly formed and proceed with an analysis appropriate 

1 

for a true experiment, including any of ^ these which follow. If, however, 
the groups differ significantly on the pretest, indicating the groups are 
not likely to represent random samples from a common population, there Is 
little agreement regarding how such data should be treated, . The purpose of 
this paper is to consider a number of analytic methods that might be used 
with such data. In order to facilitate cnmp' f.sons, each analysis reported 



-"-Note that failure to reject a null hypoL is does not Imply the truth of 
the null. Moreover, the groups may differ considerably with respect to some 
unmeasured but relevant variable. Consequently, this strategy is a poor 
substitute for random assignments to groups. At issue Is whether the groups 
can\be considered equivalent or non-equivalent in both a statistical and a 
practical sense. 
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below was carried out using a common set of real data. Pre- and posttest scores 
"from the reading subtests of the Metropolitan Achievement Tests Primary 
Level II, were obtained for children in each of three intact groups. Each 
group was instructed using a different reading program during f.he course of 
the academic year.' The tests were administered in the fall and the spring to • 
the children in all groups. Pretest scores were not available vzhen the 
groups were formed. Even though the Hetr^olitan provides three subtest scores 
for reading (word knowledge, word analysis, artc^^ reading) , only the total 
reading scores were used in the analysis reported here. The total reading 
scores are obtained by summing the number of correct responses acx'oss the 
three subtests. Ordinarily, a multivariate analysis of the subtest scores 
would be preferred, but univariate analyses using total reading scores are 
reported in this paper to facilitate the discussion. In practice, the parsi- 
mony achieved by the use of a single composite score is gained at the expense 
of diagnostic information that a multivariate analysis of the subtests would 
have afforded. 

■ i» 

ANOVA on Gain Scores 

Perhaps one of the most obvious analyses for data -f this type is to 
compare the raw gain scores across groups. \^ile it is true that gain scores 
tend to be highly unreliable, this characteristic. of gain scores is of great- 
est concern when gain scores are to be used in a correlational study. The 



^The writers wish to express their appreciation tjj Dr. Rose Sabaroff for 
providing us the data for the analyses reported in this paper. 
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unreliability of gain scores has been shown not to be a valid cojicern n 
the interest is to compare differences between experimental treatment groups 
(Overall and Woodword, 1975). 

Table 1 presents the means and standard deviations of tUe pretest, po'pt- 
test, and gain scores earned by the three^ groups . An analysis of variancej 
using the gain scores revealed that the differences are significant (p < .^01) 
and a'Newman-Keuls post-hoc test indicated that the pairwise differences j 

among all three groups were also significant (p < .01). If one were to 1 

... . \ 

p-resent results such as these to a decision maker who is not well versed in 

the ways of gain scores, he might v/.ell decide against the programs used with 

groups I and III and choose the program used with group II. You might feel 

obliged, as an action researcher, to explain that the sihaller gains obseryejl 

for group I may simply reflect the fact that the group was of higher abilit| 

to begin with and there was less room for improvement in this particular te^t 

■ i 

in comparison to the other groups.^ Thus, had the groups been of rqual [ 
ability at the beginning of the year, the analysis of gain scores may lead tp, 
quite a different conclusion. - [ 

Analysis o f Ccvarian ce . ] 

Rather than attempt to explain to a decision maker that "what you see ^ 
is not what you get," due to pre-existing differences between groups, you inay| 
decide to use the analysis of covariance (ANCOVA) to "... make adjustments ^ 



r. 



the groups had been formed on the basis of the pretest scores, the 
regres.<=ion toward the mean phenomenon might also be used to explain such 



a result. Such was not the ca^e with these data. 
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' for ^he effects of the uncontroli'.ed variables^ in comparing group perfor- 
nvunce" (Tatsuoka, L971, p. 40). lr\ this exam|5le, ANCOVA was used to /^control** 
for pre-existing ditferences in reading ability as measured by the pretest. 
Essentially, the analysis of covariance adjusts the group's mean scores on 
'-^ the dependent variable(s) or posttest scores as a function of the group's 
performance on the covariate. The slope of the regression line of the post- 
^ test on the pretest is used to make the ^'appropriate*' adjustment, and it must 
be assumed that the slope of the regression does not differ significantly 
across groups, (A conservative level of should be used in this test since 
the objective is to show that the null is tenable,) • 

^ \'Jhen the ANCOVA was applied to the reading scores, the assumption indi-^ 
cated above was well satisfied (p 1 .90) and the differences among the adjusted 
posttest means were found to be significant (P £ ' ^ consideration of 

the adjusted posttest mean scores, which are also presented in Table 1, 
suggests that, after initial differences in ability are adjusted, the reading 
program used with group III was not nearly as effective as. those used with 
groups I and II. Before attempting to convince |the decision maker that the 
results of the ANCOVA are to be believed over thk ANOVA on gain scores,,,. • 
one might wish to consider a modified ANCOVA. 

ANCOVA Using Estimated True Scores 

Lord (1963) has pointed out that "makia,- allowances for initial differ- 
ences among groups on a poor measure of some variable is not the same 
thing as making allowances for initial differences in the variable itself," 
The procedure suggested by Lord (1960) to overcome this problem requires 
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the administration of the same pretest Lwice in o5der to arrive at the 
estimated true scores. Since even in the best of situations, it would be 
rare to be able to administer two pretests to all* subjects in all groups, 
a ''reasonable" alternative to this was taken in order to obtain estimated 
true scores on the pretest >far^Ue data reported herein. 

By using the classical measurement assumption that the standard error of 
measurement is constant o){ev the ability range measured by a test> it was 
possite to estimate the reliability of the test in this setting by substitu- 
ting the standard error of measurement provided by the test manual in the 
following formula: 

s = sA/l - r , 

meas xv xx 

substituting,, the pooled posttest standard deviations of the pretest scores 
for s and soiving for r • Using this estimate of r^^, the estimated true 
scores were computed using: 

T = X + r^^ (X-y; 
where T is the estlrated true score, X is the observed score and is the 
mean of the group to which each subject belongs. In words, each person's 
score was regressed toward the mean of his group as a function of the esti- 
mated reliability, ^^hen the estimated true scores so determined were entered 
into the usual analysis of covariance procedures, slight differences were 
observed in the adjusted posttest. scores as shown in Table 1. In this appli-^ 
cation, the effect was small since the composite total reading scores were 
already highly reliable. With less reliable covariates , however, the use of 
estimated true scores may substantially alter the results of the ANCOVA (Lord, 



--6- 



The use of ANCOVA using estimated true scfores was incXudeid here since 
it is recommended by Porter and Chibucos (1974) as the preferred method of 
analysis with data from non-equivalent control group designs 

A Comparison of Gain Score ANOVA vs ANCOVA 

..The fact that the gain score analysis and the analyses of covariance 
reported above give seemingly contradictory results is not an artifact of 
these particular data, but can be expected. to occur routinely with da^jia from 
non-equivalent control group designs. 'The analysis of covariance simply 
anticipates and adjusts scores so as to ■ account for the phenomenon referred 
to as regression toward the mean. .When any group is measured twice on the 
same variable, there will be a tendency for the high or low scoring individuals 
(or subgroups) to regress toward the mean iless everyone earned the same 
score .on both occasions. Note,that a person's score is regressed , toward 
the mean of the groUp of/Sv^j^ch he is a member or can be assumed to be a mem- 
ber. - It does not make sense to regress a person' s. score toward the mean of 
■•a group if he could not reasonably be assumed to belong .to the group. The 
latter, however, is essentially what the analysis of covariance does when 
It is applied to data like that reported here. Only if the pretest means do 
not differ significantly is it reasonable to regress these means toward a 
common population mean. Lord (1967) offers a vivid illustration of the perils 
associated with using ANCOVA when a single treatment is applied to samples 
drawn from two distinct populations. The point made by Lord can be illustrated 



^Low reliability may contribute to the regresslori'ef f ect but even if perfectly 

reliable measurements are taken, the r^egression effect should be anticipated 

as long as the correlation between pre- and post- sco;res is less than perfect, 
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with the present example by considering what would happen if the pretest 
had been. given May and the posttest had been given the following October. 
In such a situation, some pupils would be expected to gain over the summer 
and others would lose, but it. might be reasonable that by October, the pre- 
test and posttest means would be nearly the same within each ability group. 
Such an outcome was approximated with the present data by subtracting from 
each person's posttest score an amount equal to the difference between the 
pretest and posttest means for his group. Making the pretest and posttest 
mean scores equal within each group, while the individual scores are free to 
change^ represents a cor.dition Lord refers ta, as dynamic equilibrium. An . 
analysis of covariance was then applied with the result that, after "control- 
ling*' for the pretest differences, the groups were found to differ significantly 
(p < .001). The adjusted posttest means are shown in Tabled. Inasmuch as no 
. group gained or lost, it may be a bit awkward tryinp, to explain to the decision 
maker how the summer had a significantly mor^^, favorable effect on the high 
ability group in comparison to with average and low ability groups. Notice that 
each group was exposed to the same treatment, summer- When each group is exposed 
to a different treatment, the explanation becomes even more tedious, ^ if not absurd. 

Studies which provide data of the type reported here may have been r^psigned 
as' single factor studies, but, by default, become two factor studies when the 
groups are found to differ significantly prior to treatment. It is impossible 
to disentangle the effects of the two factors unless each treatment is applied, 
to each ability group. Moreover, the analysis of covarinnce cannot elimi.nnt.e 
the confounding of the ability factor by making equal that which'^Ood made 
unequal. It is for these reasons that the writers recommend against the use 

* . 
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of ANCOVA to analyze data of this type. Such advic^a is consistent with that 
offerer? by some (e.g. , Elashlof f , " 1969 ; Cronbach and Furby, 1970; Lord, 1963, 
1967), but is counter to the advice offered by others (Campbell and. Stanley, 
1963^, p. 49; Ferguson, 1971, p. 288; Tatsuoka, 1971). Nor wiir use of esti- 
mated true scores resolve the 'difficulty because the problems indicated 
above hold even for perfectly reliabTe measures on the covariate. (Note that 
in-^^-the proceciure outlined above for getting the estimated true scores, the 
observed scores were regressed toward the means of the respective groups,, not 
toward the grand mean.) 

The analysis of gain scores should be preferred over analysis of co- 
variance in non-equivalent control group des?-gn if only because it is more . 
easily understood and requires fewer assuxnptions . Once regression toward a 
common mean is eliminated from consideration, what factors,' it ;.ny, argue 
against a straight-forward interpretation of gain scores'? One factor., which 
seems to have been operative in the present study was an artificial ceiling 
effect associated with the use of this particular test. This effect is 
evident by the fact that the mean posttest scared for the high ability group 
(X = 105.5) was close to the maximum possible ,s£bre (119). Were it not for 
this ceiling effect, it ''might be reasonable to expect the high ability group 
to maintain or increase their superiority by gaining the most. While this , 
effect works in the opposite direction of the ceiling effect, it is not 
reasonable to assume the two will balance each other. The only interpretation 
that can be dravm from a gain, score ANOVA as reported here is that the 



/ amount of gain was significantly different when program I was used with 
1 ' ^ 
"high" ability pupils, program II was used with^'average" ability and prograr 

III was used with "low" ability pupils. / 
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Alternative Analyses 

While it does not seem' reasonable to make comparisons across treatments, 
it may be of interest to consider within treatment comparisons. For example, 
it may be of interest to test whether the mean gain observed within a parti- 
cular program represents a statistically significant gain. The dependent 
t-tesf would be appropriate for such a test. Applying the dep^nder.c t-test to the 
data reported here, the t-values were all highly significant- The inference to 
be made from these* tests is with reference to subsequent samples drawn at random 
s./^ple from the three distinct ability populations associated with each group - ' 
In this application, the gains within all three groups were so large that a 
statistical test of the null hypothesis may seem trivial. It may, however, be of 
interest to test whether the obs,erve"& gain is significantly different from some 
a priori expectation of gain bas^d on practical or theoretical considerations , 
rather than to test against a nuLl hypothesis' cTf zero gain- One of the more inter 

esting proposals in this regard is that by Bryk and WeUsberg^ (1976), called 
Value-Added Analysis. Very briefly, the pre- and posttests are viewed as snap-- 
shots of an on-going developmental process and chron61ogical age (or some other 
variable) is regressed onto the pretest scores to provide an estimate of the 
growth that might be expected without 'Special intervention. Unf ortunaitely , when 
this strategy wa3 applied to the data reported here, it was found that .^^the 

regression of age on pretest scores was nearly zero ..which argued against the use 

• . ■ • 5 ^ ■ ■ • 

of this new analysis. '* • . 

In certain situations, it may be of interest simply to determine whether the 
observed mean gain could be attributed to errors in measurement alone- For 
example, if a test had been administered to this audience before and 
again after this present^ition, it might be of interest to determine whether 



5 ■ - . . 

This .outcome was quite disappointing since it seemed reasonable to find some 
relationship^ between chronolog** al age and reading ability for children in' 
''regular** third grade classes. 

; : 12 
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the observed gain (regardless of sign!) represents a difference larger than 
what might be attributable to error. The standard error associ- 

ated with sampling error t: in tnis case since we do Ish 

to' generalize beyond thi.b j. audience. The procedures outl^ 

Davis (1964) to estimate the standard error of measurement of the mean change 

should be used .rather than the usual dependent t-tect. VThile the standard 

■ \ 

error of mean change is conceptually d.'st'S.-^ct from the standard errcr of measure- 

V, ' ' ■ 

ment of mean change, operationally the distinction can be lost. Specifically, 
the variance of the difference scores used in the dependent t-test can also 
be taken as an estimate of the variance ^error of measurement if an additive 
treatment model can be assumed (Rulon, 1941; Overall and Woodword, 1975). 
Viewed in this way, it seems inappropriate to use the results of a dependent 
t-test to make inferences regarding subsequent samples drawn from . the same 
population .since the standard error would reflect measurement error and not 
sampling error. 

Summary 

. r ' 

If it is necessary to go beyond simple descriptive statistics resulting 
from non-equivalent group designs, statistical comparisons across groups 
should be avoided. Within' groups comparisons' are logically ..consistent , but 
whether the results should be used to make a statistical inference or a 
measurement inference must be considered. If tiese recommendations do not 
offer much help for nnnlyzing data of tlris type, so he It. Perhaps It (s 
time for action researchers to educfit^'decislon makers regarding the impor- 
tance of random assignments to groups if statistical tests are to aid in 

13 



the evaluation of alternative programs. If all else, fails, the descriptive 
statistics can be scrutinized carefully and can provide a basis for judgment in 
much the same manner people choose spouses. If it later turns out that the 
decision was in error, at least ^ --^sn^ fault of the action researcher 

who inappropriately used the analysis variance. ' 
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Summary Statistics of Non^equivalent Group Data 



Pretest 



Posttest 



; 1 

Group I 

n = 46 92. 


y 


s 

. .y. 


Raw 
Gain 


Adjusted 


Ad j_usted 


Adjusted 

y A** 


105.52 


11.52 


12.81 


94.09 


93.74 


81.29 


Group II 

n = 60 62.52 


23.57 


9,1. OO 


17.07 


28.48 


96.08 


96.24 


67.60 


Group III 

n = 19 50.53 


18.28 


74.11 


18.79 


23.58 


85.74 


86.10 


62 . 17 



*Based on raw score ANCOVA 
**Based on estimated true score ANCOVA 

**3ased on ANCOVA with equal pre- and pdsttest means within groups 
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